Load the necessary libraries
library(mgcv) #for GAMs
library(gratia) #for GAM plots
library(broom) #for tidy output#
library(emmeans) #for marginal means etc
library(MuMIn) #for model selection and AICc
library(tidyverse) #for data wrangling
library(DHARMa) #for simulated residuals
library(performance) #for residual disagnostics
library(see) # to visualize residual diagnostics
In a chapter on time series analysis, Reed et al. (2007) presented Hawaiian longitudinal waterbird survey data. These data comprise winter counts of various species of stilts, coots and moorehen along with year and the previous seasons rainfall. Here, we will explore the temporal patterns in the Kauai Moorhen.
Moorhen
Format of reed.csv data file
| Year | Stilt.Oahu | Stilt.Maui | Coot.Oahu | Coot.Maui | Moorhen.Kauai | Rainfall |
|---|---|---|---|---|---|---|
| 1956 | 163 | 169 | 528 | 177 | 2 | 15.16 |
| 1957 | 272 | 190 | 338 | 273 | NA | 15.48 |
| 1958 | 549 | 159 | 449 | 256 | 2 | 16.26 |
| 1959 | 533 | 211 | 822 | 170 | 10 | 21.25 |
| 1960 | NA | 232 | NA | 188 | 4 | 10.94 |
| 1961 | 134 | 155 | 717 | 149 | 10 1 | 9.93 |
| Year | - a continuous predictor |
| Stilt.Oahu | - the abundance of the Oahu stilt |
| Stilt.Maui | - the abundance of the Maui stilt |
| Coot.Oahu | - the abundance of the Oahu coot |
| Coot.Maui | - the abundance of the Maui coot |
| Moorhen.Kauai | - the abundance of the Kauai moorhen |
| Rainfal | - the number of centimeters (or inches) of rain |
reed = read_csv('../public/data/reed.csv', trim_ws=TRUE)
## Rows: 48 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): Year, Stilt.Oahu, Stilt.Maui, Coot.Oahu, Coot.Maui, Moorhen.Kauai, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(reed)
## Rows: 48
## Columns: 7
## $ Year <dbl> 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 19…
## $ Stilt.Oahu <dbl> 163, 272, 549, 533, NA, 134, 175, 356, 485, 184, 242, 20…
## $ Stilt.Maui <dbl> 169, 190, 159, 211, 232, 155, 282, 170, 164, 162, 253, 1…
## $ Coot.Oahu <dbl> 528, 338, 449, 822, NA, 717, 12, 169, 98, 112, 77, 106, …
## $ Coot.Maui <dbl> 177, 273, 256, 170, 188, 149, 205, 108, 79, 53, 75, 80, …
## $ Moorhen.Kauai <dbl> 2, NA, 2, 10, 4, 10, 12, 10, 8, NA, 17, 7, 44, 50, 26, 1…
## $ Rainfall <dbl> 15.16, 15.48, 16.26, 21.25, 10.94, 19.93, 12.63, 20.09, …
Model formula: \[ y_i \sim{} \mathcal{Pois}(\lambda_i)\\ log(\lambda_i) =\beta_0 + f(Year_i) + f(Rainfall_i) \]
where \(\beta_0\) is the y-intercept. \(f(Year)\) and \(f(Rainfall)\) indicate the additive smoothing functions of Year and Rainfall respectively.
We begin with a simple scatterplot of the response (Moorhen.Kauai counts) against our main predictor (Year).
ggplot(reed, aes(y=Moorhen.Kauai, x=Year)) +
geom_point()
Conclusions:
We could put a loess smoother through these data, however, a loess smoother will assume a Gaussian distribution. We could alternatively, explore a GAM smoother with a Poisson distribution. Keep in mind, that whilst this will use GAM in the backend, it will not do all the critical assumption checking and thus is useful as an exploratory data analysis routine only.
ggplot(reed, aes(y=Moorhen.Kauai, x=Year)) +
geom_point() +
geom_smooth(method='gam', formula=y~s(x),
method.args=list(family='poisson'))
We can also do a similar thing for Rainfall.
ggplot(reed, aes(y=Moorhen.Kauai, x=Rainfall)) +
geom_point() +
geom_smooth(method='gam', formula=y~s(x),
method.args=list(family='poisson'))
As we did in the previous example, it might be informative to start off by exploring the spline formulation. This time, we will explore both thin-plate and cubic regression.
basis(s(Year, bs='tp'), data=reed) %>% draw
basis(s(Year, bs='cr'), data=reed) %>% draw
reed.gam1 <- gam(Moorhen.Kauai ~ s(Year, bs = 'cr') +
s(Rainfall, bs = 'cr'),
data = reed,
family = poisson(link = 'log'),
method = 'REML')
reed.gam1a <- gam(Moorhen.Kauai ~ s(Year, bs = 'cr') +
s(Rainfall, bs = 'cr'),
data = reed,
family = poisson(link = 'log'),
method = 'REML',
select=TRUE)
k.check(reed.gam1)
## k' edf k-index p-value
## s(Year) 9 8.585258 0.7764363 0.0650
## s(Rainfall) 9 6.976043 1.3254038 0.9825
Conclusions:
appraise(reed.gam1)
Conclusions:
reed.gam2 = gam(Moorhen.Kauai ~ s(Year, k = 20, bs = 'cr') +
s(Rainfall, bs = 'cr'),
data = reed,
family = poisson(link = 'log'),
method = 'REML')
k.check(reed.gam2)
## k' edf k-index p-value
## s(Year) 19 18.072011 1.294476 0.9725
## s(Rainfall) 9 1.000482 1.124450 0.7575
Conclusions:
appraise(reed.gam2)
Conclusions:
concurvity(reed.gam2)
## para s(Year) s(Rainfall)
## worst 9.291028e-31 0.8621898 0.8621898
## observed 9.291028e-31 0.2529843 0.6275355
## estimate 9.291028e-31 0.2333866 0.4814997
Conclusions:
performance::check_model(reed.gam2)
## Error in data.frame(Term = terms, VIF = result, SE_factor = sqrt(result), : arguments imply differing number of rows: 0, 1
Conclusions:
reed.resids <- simulateResiduals(reed.gam2, plot=TRUE)
testZeroInflation(reed.resids)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 0, p-value = 1
## alternative hypothesis: two.sided
Conclusions:
In mgcv, there are two methods for fitting negative binomial:
reed.gam3 <- gam(Moorhen.Kauai ~ s(Year, k = 20, bs = 'cr') +
s(Rainfall, bs = 'cr'),
data = reed,
family = nb(link = 'log'),
method = 'REML')
## get the final theta estimate
(theta <- reed.gam3$family$getTheta(TRUE))
## [1] 4.078321
## or with supplied theta
reed.gam3 = gam(Moorhen.Kauai ~ s(Year, k = 20, bs = 'cr') +
s(Rainfall, bs = 'cr'),
data = reed,
family = negbin(link = 'log', theta = theta),
method = 'REML')
k.check(reed.gam3)
## k' edf k-index p-value
## s(Year) 19 7.360630 0.6817874 0.0075
## s(Rainfall) 9 1.452223 1.2007731 0.9075
appraise(reed.gam3)
concurvity(reed.gam3)
## para s(Year) s(Rainfall)
## worst 9.291028e-31 0.8621898 0.8621898
## observed 9.291028e-31 0.2688625 0.6069471
## estimate 9.291028e-31 0.2333866 0.4814997
performance::check_model(reed.gam3)
## Error in data.frame(Term = terms, VIF = result, SE_factor = sqrt(result), : arguments imply differing number of rows: 0, 1
reed.resids <- simulateResiduals(reed.gam3, plot=TRUE)
testZeroInflation(reed.resids)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 0, p-value = 1
## alternative hypothesis: two.sided
testTemporalAutocorrelation(reed.resids, time=reed.gam3$model$Year)
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.4999, p-value = 0.08627
## alternative hypothesis: true autocorrelation is not 0
# OR
Year = reed %>%
filter(!is.na(Moorhen.Kauai)) %>%
pull(Year)
testTemporalAutocorrelation(reed.resids, time=Year)
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.4999, p-value = 0.08627
## alternative hypothesis: true autocorrelation is not 0
plot(acf(residuals(reed.gam3,type='pearson')))
draw(reed.gam3)
draw(reed.gam3, residuals = TRUE, scales = 'free')
#plot(reed.gam3, pages=1)
#plot(reed.gam3, pages=1, shift=coef(reed.gam3)[1])
#plot(reed.gam3, pages=1, shift=coef(reed.gam3)[1], scale=0)
plot(reed.gam3, pages=1, shift=coef(reed.gam3)[1], trans=exp,
resid=TRUE, cex=4, scale=0)
summary(reed.gam3)
##
## Family: Negative Binomial(4.078)
## Link function: log
##
## Formula:
## Moorhen.Kauai ~ s(Year, k = 20, bs = "cr") + s(Rainfall, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.81247 0.08035 47.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Year) 7.361 9.070 164.116 <2e-16 ***
## s(Rainfall) 1.452 1.754 0.475 0.759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.796 Deviance explained = 81%
## -REML = 214.57 Scale est. = 1 n = 45
Conclusions:
tidy(reed.gam3)
Conclusions:
reed.gam4 <- gam(Moorhen.Kauai ~ s(Year, k = 20, bs = 'cr') +
Rainfall,
data = reed,
family = nb(link = 'log'),
method = 'REML')
(theta <- reed.gam4$family$getTheta(TRUE))
## [1] 4.014475
reed.gam4 <- gam(Moorhen.Kauai ~ s(Year, k = 20, bs = 'cr') +
Rainfall,
data = reed,
family = negbin(link = 'log', theta = theta),
method = 'REML')
AICc(reed.gam3, reed.gam4)
Conclusions:
k.check(reed.gam4)
## k' edf k-index p-value
## s(Year) 19 7.376833 0.6783268 0.01
appraise(reed.gam4)
Conclusions:
reed.resid <- simulateResiduals(reed.gam4, plot=TRUE)
testDispersion(reed.resid)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.43448, p-value = 0.088
## alternative hypothesis: two.sided
testZeroInflation(reed.resid)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 0, p-value = 1
## alternative hypothesis: two.sided
testTemporalAutocorrelation(reed.resid, time=reed.gam4$model$Year)
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.4494, p-value = 0.05804
## alternative hypothesis: true autocorrelation is not 0
Year = reed %>%
filter(!is.na(Moorhen.Kauai)) %>%
pull(Year)
testTemporalAutocorrelation(reed.resid, time=Year)
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.4494, p-value = 0.05804
## alternative hypothesis: true autocorrelation is not 0
Conclusions:
summary(reed.gam4)
##
## Family: Negative Binomial(4.014)
## Link function: log
##
## Formula:
## Moorhen.Kauai ~ s(Year, k = 20, bs = "cr") + Rainfall
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.8011506 0.2248380 16.906 <2e-16 ***
## Rainfall 0.0007641 0.0114932 0.066 0.947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Year) 7.377 9.116 163.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.791 Deviance explained = 80.5%
## -REML = 218.02 Scale est. = 1 n = 45
reed.gam5 <- gam(Moorhen.Kauai ~ s(Year, k = 20, bs = 'cr'),
data = reed,
family = nb(link = 'log'),
method = 'REML')
(theta=reed.gam5$family$getTheta(TRUE))
## [1] 4.249445
reed.gam5 <- gam(Moorhen.Kauai ~ s(Year, k = 20, bs = 'cr'),
data = reed,
family = negbin(link = 'log', theta = theta),
method = 'REML')
AICc(reed.gam3, reed.gam4, reed.gam5)
Conclusions:
k.check(reed.gam5)
## k' edf k-index p-value
## s(Year) 19 7.701981 0.6889843 0.0175
appraise(reed.gam5)
Conclusions:
reed.resid <- simulateResiduals(reed.gam5, plot=TRUE)
testDispersion(reed.resid)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.43274, p-value = 0.088
## alternative hypothesis: two.sided
testZeroInflation(reed.resid)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 0, p-value = 1
## alternative hypothesis: two.sided
Year = reed %>%
filter(!is.na(Moorhen.Kauai)) %>%
pull(Year)
testTemporalAutocorrelation(reed.resid, time=Year)
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.4501, p-value = 0.0584
## alternative hypothesis: true autocorrelation is not 0
Conclusions:
summary(reed.gam5)
##
## Family: Negative Binomial(4.249)
## Link function: log
##
## Formula:
## Moorhen.Kauai ~ s(Year, k = 20, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.81264 0.07898 48.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Year) 7.702 9.528 171.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.802 Deviance explained = 81.1%
## -REML = 214.49 Scale est. = 1 n = 45
Conclusions:
Whether we chose to display the trends from the model with a smoother for year and a smoother for rainfall (Model 3) or just the model that has the temporal smoother (Model 5), we would probably only be interested in the temporal trends (since the rainfall trends are not significant). Hence, we will only produce the temporal trends.
reed.list <- with(reed, list(Year = modelr::seq_range(Year, n=100)))
newdata <- emmeans(reed.gam3, ~Year, at = reed.list, type = 'response') %>%
as.data.frame
head(newdata)
ggplot(newdata, aes(y = response, x = Year)) +
geom_ribbon(aes(ymin = lower.CL, ymax = upper.CL), fill = 'blue', alpha = 0.3) +
geom_line() +
theme_bw()
We might also want to display the observed values. When there are multiple smoothers, displaying the raw data alongside the modelled trend can be very misleading. This is because whilst the modelled trend illustrates the estimated trend of one predictor holding all other predictors constant (and thus standardising over these other predictors), the raw observations will not be similarly standardised. As a result, the modelled trend may appear inconsistent with the raw observations.
It is therefore better to calcualte partial residuals. For each of the observed values of the focal predictor, we calculate the fitted value to which we then add the residuals. The result are the equivalent of the observed values that have been standardised by the other predictors and these are more consistent with what the model actually saw.
In the following, we will calculate the partial residuals and add them (black points) to the temporal trends along with the raw observed values (red). We would of course not normally add the raw observed values, but here it helps to see what impact the standardisation has.
reed.obs <- with(reed.gam3$model,
data.frame(Year = Year,
Rainfall = mean(Rainfall))) %>%
mutate(Pred = predict(reed.gam3, newdata=., type='link'),
Resid = reed.gam3$residuals,
Presid = exp(Pred + Resid))
## reed.presid <- data.frame(Year = reed.gam3$model$Year,
## Rainfall = mean(reed.gam3$model$Rainfall)) %>%
## mutate(Pred = predict(reed.gam3, newdata=., type='link'),
## Resid = reed.gam3$residuals,
## Presid = exp(Pred + Resid))
head(reed.obs)
ggplot(newdata, aes(y=response, x=Year)) +
geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', alpha=0.3) +
geom_line() +
geom_point(data=reed.obs, aes(y=Presid)) +
geom_point(data=reed, aes(y = Moorhen.Kauai, x=Year), color='red')
Conclusions:
In this case, as there is only a single predictor, it is fine to use the raw observed values in the plot.
reed.list= with(reed, list(Year=seq(min(Year), max(Year), len=100)))
newdata = emmeans(reed.gam5, ~Year, at=reed.list, type='response') %>%
as.data.frame
head(newdata)
ggplot(newdata, aes(y=response, x=Year)) +
geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', alpha=0.3) +
geom_line() +
theme_bw() +
geom_point(data=reed, aes(y = Moorhen.Kauai, x=Year))
Conclusions: